Spark 46959 by azmatsiddique · Pull Request #54812 · apache/spark

azmatsiddique · 2026-03-15T07:36:25Z

What changes were proposed in this pull request?

This PR fixes an issue where the CSV reader inconsistently parses empty quoted strings ("") when the escape option is set to an empty string ("").

Previously, if escape="" was used, mid-line empty quoted strings were correctly resolved to an empty string, but the last column resolved to a literal " character. This occurred because Spark Maps escape="" to \u0000, which univocity’s parser relies on. At the end of a line, without a trailing delimiter, univocity misinterprets the second " as an escaped quote rather than a closing quote.

The fix introduces a post-processing step in UnivocityParser.parseLine to detect this specific condition (a single quote character as the last token when escape is \u0000) and replace it with the configured emptyValueInRead.

Why are the changes needed?

To ensure consistent parsing of CSV data regardless of whether an empty quoted string appears in the middle of a line or at the end of a line.

Does this PR introduce any user-facing change?

Yes, it fixes a bug where users were receiving incorrect data (a literal quote instead of an empty/null value) for the last column in a row under specific CSV configurations (e.g. escape="", quote="\"", sep=";").

How was this patch tested?

Added a new regression test in CSVSuite.scala:
"SPARK-46959: CSV reader reads data inconsistently depending on column position"

The test verifies that an empty quoted string behaves identically in the mid-line position (column c) and the end-of-line position (column d) when configured with escape="" and nullValue="".

Verified that both CSVv1Suite and CSVv2Suite pass without regressions.

Was this patch authored or co-authored using generative AI tooling?

No

…s corrupted file

…n last CSV column ### What changes were proposed in this pull request? This PR fixes an issue where the CSV reader inconsistently parses empty quoted strings (`""`) when the `escape` option is set to an empty string (`""`). Previously, mid-line empty quoted strings correctly resolved to null/empty, but the last column resolved to a literal `"` character due to univocity parser behavior. ### Why are the changes needed? To ensure consistent parsing of CSV data regardless of column position. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug where users were receiving incorrect data (a literal quote instead of an empty/null value) for the last column in a row under specific CSV configurations. ### How was this patch tested? Added a new regression test in `CSVSuite` that verifies consistent parsing of both mid-line and end-of-line empty quoted fields.

azmatsiddique added 3 commits March 14, 2026 22:48

[SPARK-55968][SQL] Do not treat vectorized reader capacity overflow a…

ad9573a

…s corrupted file

Trigger Github Actions

b127ca2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 46959#54812

Spark 46959#54812
azmatsiddique wants to merge 3 commits intoapache:masterfrom
azmatsiddique:SPARK-46959

azmatsiddique commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azmatsiddique commented Mar 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant